Building Queryable Datasets from Ungrammatical and Unstructured Sources

نویسنده

  • Matthew Jeremy Michelson
چکیده

For agents to act on behalf of users, they will have to query the vast amounts of textual data on the internet. However, much of this text cannot be queried because it is neither grammatical nor formally structured enough to support traditional information extraction approaches to annotation. Examples of such text, called “posts,” include item descriptions on Ebay or internet classifieds like Craig’s list. This work describes an approach to annotating posts by combining record linkage with information extraction. This approach leverages collections of known entities, called “reference sets,” by first aligning a post to a member of a reference set, and then exploiting this matched member during information extraction. This thesis compares this extraction approach to more traditional information extraction methods that rely on structural and grammatical characteristics, and it shows that this approach outperforms traditional methods on this type of data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Reference-set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources

This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Inste...

متن کامل

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look∗

There exist numerous sources of data on the World Wide Web that contain useful information but are not structured or grammatical enough to support traditional information extraction. Furthermore, even if the information extraction could be done, the extracted values would need to be standardized to ensure the queries over the source are accurate. This paper presents an automatic, scalable appro...

متن کامل

Improving Clinical Diagnosis Inference through Integration of Structured and Unstructured Knowledge

This paper presents a novel approach to the task of automatically inferring the most probable diagnosis from a given clinical narrative. Structured Knowledge Bases (KBs) can be useful for such complex tasks but not sufficient. Hence, we leverage a vast amount of unstructured free text to integrate with structured KBs. The key innovative ideas include building a concept graph from both structure...

متن کامل

Creating Relational Data from Unstructured and Ungrammatical Data Sources

In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructu...

متن کامل

Semantic Annotation of Online Ad Portals

Online classified ad portals have become very popular in recent times as they provide affordable and efficient advertising services to consumers and businesses and have a larger audience base when compared to traditional means of advertising. The ads on these portals, however, are typically posted by ordinary users in an unstructured, ungrammatical and (at time) incoherent manner which makes se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005